First, we want to import the libraries “ggplot2” and “dplyr” to help with our data management and visualize.
library("ggplot2")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We then want to read in our hotel bookings dataset from the csv file. A lot of the current formatting needs to be factorized, such as company, hotel type, month, etc.
hotel_bookings = read.csv("hotel_bookings.csv")
hotel_bookings$company = as.factor(hotel_bookings$company)
hotel_bookings$hotel = as.factor(hotel_bookings$hotel)
hotel_bookings$arrival_date_month = as.factor(hotel_bookings$arrival_date_month)
hotel_bookings$arrival_date_year = as.factor(hotel_bookings$arrival_date_year)
hotel_bookings$meal = as.factor(hotel_bookings$meal)
hotel_bookings$country = as.factor(hotel_bookings$country)
hotel_bookings$market_segment = as.factor(hotel_bookings$market_segment)
hotel_bookings$distribution_channel = as.factor(hotel_bookings$distribution_channel)
hotel_bookings$reserved_room_type = as.factor(hotel_bookings$reserved_room_type)
hotel_bookings$assigned_room_type = as.factor(hotel_bookings$assigned_room_type)
hotel_bookings$agent = as.factor(hotel_bookings$agent)
hotel_bookings$reservation_status_date = as.factor(hotel_bookings$reservation_status_date)
hotel_bookings$reservation_status = as.factor(hotel_bookings$reservation_status)
hotel_bookings$customer_type = as.factor(hotel_bookings$customer_type)
hotel_bookings$deposit_type = as.factor(hotel_bookings$deposit_type)
Now that we’ve formatted the data correctly, we can move on to our first analysis, which is looking at the demographics of customers.
#Load the data
hotel_bookings$country <- as.character(hotel_bookings$country)
countryorigin <- hotel_bookings[,c(1,14)]
#Transform 3 digit codes into country names
library(countrycode)
countryorigin$country <- countrycode(countryorigin$country, "iso3c", "country.name", nomatch = NULL)
countryorigin$country <- as.factor(countryorigin$country)
originofRH <- countryorigin[countryorigin$hotel == "Resort Hotel",]
RH <- table(originofRH$country)
barplot((sort(RH, decreasing = TRUE))[1:6], main = "Top 6 Countries of Customer Origin at Resort Hotel", ylab = "# of visitors", col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), cex.names = 0.8)
The Resort Hotel has more foreign customers from UK(#2) and Ireland(#4). Many families from Northern Europe choose to spend their holidays at Iberian peninsula, where has warmer weather comparing to their home countries. We also see visitors from Spain(#3) and France(#5) due to geographical closeness.
originofC <- countryorigin[countryorigin$hotel == "City Hotel",]
CH <- table(originofC$country)
barplot((sort(CH, decreasing = TRUE))[1:6], main = "Top 6 Countries of Customer Origin at City Hotel", ylab = "# of visitors", col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), cex.names = 0.8)
At City Hotel, we see most foreign customers from France, Germany, UK, Spain, and Italy. Those 5 countries are also the 5 countries with the highest GDP in the European Union (before Brexit). Given that a lot of City Hotel customers travel for business purposes, it makes sense that those customers are coming from the bigger economies of EU and come to Portugal to handle their business.
library("maps")
world_map <- map_data("world")
country_names <- levels(countryorigin$country)
bookings <- c()
for(i in 1:length(country_names)){
n = country_names[i]
bookings <- c(bookings, NROW(countryorigin$hotel[countryorigin$country==n]))
if(n=="United States"){
country_names[i] = "USA" #change to USA to match iso content
}
if(n=="United Kingdom"){
country_names[i] = "UK" #change to UK to match iso content
}
}
counts <- data.frame(region = country_names, count = bookings)
booking_by_region <- merge(world_map, counts, sort = FALSE, by = "region", all.x=TRUE)
booking_by_region <- booking_by_region[order(booking_by_region$order), ]
# Plot a map, filling in the countries based on booking amount
ggplot(data=booking_by_region, aes(x=long, y=lat, group=group, fill=count)) + geom_polygon()
We can see that the hotels have seen customers from almost every country in the world, with the odd notable exception of Canada not having any bookings.
world_map <- map_data("world")
country_names <- levels(countryorigin$country)
bookings <- c()
for(i in 1:length(country_names)){
n = country_names[i]
bookings <- c(bookings, NROW(countryorigin$hotel[countryorigin$country==n]))
if(n=="United Kingdom"){
country_names[i] = "UK" #change to UK to match iso content
}
}
counts <- data.frame(region = country_names, count = bookings)
Europe <- c()
europeanUnion <- c("Austria","Belgium","Bulgaria","Croatia","Cyprus", "Albania", "Kosovo","Czech Rep.","Denmark","Estonia","Finland","France", "Germany","Greece","Hungary","Ireland","Italy","Latvia", "Norway", "Lithuania","Luxembourg","Malta","Netherlands","Poland","Montenegro", "Portugal","Romania","Slovakia","Slovenia","Spain", "Serbia", "Macedonia", "Sweden","UK", "Switzerland", "Czech Republic", "Bosnia and Herzegovina")
for(c in europeanUnion){
Europe <- rbind(Europe, world_map[world_map$region==c,])
}
booking_by_region <- merge(Europe, counts, sort = FALSE, by = "region", all.x=TRUE)
booking_by_region <- booking_by_region[order(booking_by_region$order), ]
# Plot a map, filling in the countries based on booking amount
ggplot(data=booking_by_region, aes(x=long, y=lat, group=group, fill=count)) + geom_polygon()
Most of the bookings are from the European countries, especially Portugal and Eastern Europe.
Most customers are coming from Portugal at both Resort Hotel and City Hotel, so there is a high chance that both hotels are based in Portugal. The top 6 countries of origin are all European countries with some variation in rankings among 2 hotels.
It would be helpful for us to determine customer satisfaction & loyalty by looking deep into the repeated guest data and their respective cancellation rate.
repeatedguestportion <- tapply(hotel_bookings$is_repeated_guest, hotel_bookings$hotel, mean, na.rm = TRUE)
round(repeatedguestportion, digits = 3)
## City Hotel Resort Hotel
## 0.026 0.044
Resort Hotel tend to have more repeated guests than City Hotel.
barplot(repeatedguestportion, col=c("blue", "orange") , border =TRUE, axes= TRUE, ylim=c(0,0.05), main = "Percentage of repeated guest by hotel")
Both hotels have relatively low percentage of repeated guests (less than 5%), which projects low satisfaction and loyalty.
hotel_bookings$repeatedFact <- as.factor(hotel_bookings$is_repeated_guest)
temp_t1 <- aggregate(is_canceled ~ repeatedFact + hotel, data = hotel_bookings, mean)
barplot(temp_t1$is_canceled, names.arg = c("City Non-Repeated", "City Repeated", "Resort Non-Repeated", "Resort Repeated"), col=c("dark blue","blue","dark orange","orange"), border =TRUE, axes= TRUE, ylim=c(0,0.5), cex.names = 0.8, main = "Repeated & Non-Repeated Guest Cancelleation rate by hotel")
It looks like repeat guests for both Resort and City hotel have a lower cancellation rate than non-repeat guests. This makes sense as those you repeat bookings probably do so because they trust the hotel and are satisfied with their last visit.
hoteltype <- tapply(hotel_bookings$is_canceled, hotel_bookings$hotel, mean, na.rm = TRUE)
barplot(sort(hoteltype), ylim = c(0,1), col=c("lavenderblush", "deeppink"))
Here we can see that Resort Hotel has a lower overall cancellation rate than City Hotel. This being said, Resort Hotel still has an overall cancellation rate of 35% while City Hotel still has an overall cancellation rate of 40+ %.
hotel_bookings$total_guests <- hotel_bookings$adults + hotel_bookings$children + hotel_bookings$babies
cancelratebyadults <- aggregate(is_canceled ~ adults, data=hotel_bookings, mean)
cancelratebyguests <- aggregate(is_canceled ~ total_guests, data=hotel_bookings, mean)
barplot(cancelratebyadults$is_canceled, col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), names.arg = cancelratebyadults$adults, xlab="Total Number of Adults", ylab="Cancellation Rate", border =TRUE, axes= TRUE, ylim=c(0,1))
barplot(cancelratebyguests$is_canceled, col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), names.arg = cancelratebyguests$total_guests, xlab="Total Number of Guests", ylab="Cancellation Rate", border =TRUE, axes= TRUE, ylim=c(0,1))
According to this chart, there seems to be a relationship between the total number of guests on each booking and the cancellation rate. One reason that might explain this relationship is that larger groups are difficult to organize and have a higher chance of one member deciding that they no longer want to go on the trip.
cancelratebyprevcancellations <- aggregate(is_canceled ~ previous_cancellations, data=hotel_bookings, mean)
barplot(cancelratebyprevcancellations$is_canceled, col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), names.arg = cancelratebyprevcancellations$previous_cancellations, xlab="Total Number of Previous Cancellations", ylab="Cancellation Rate", border =TRUE, axes= TRUE, ylim=c(0,1))
According to the chart, there seems to be a relationship between the number of previous cancellations and the cancellation rate. On average, guests with a higher number of previous cancellations seem to have higher rates of cancellations. One possible reason for this occurrence is that individual guests may have a habit of repeatedly canceling reservations. One anomaly that we saw was that guests with 1 previous cancellation had a high level of cancellation rates, which we didn’t expect. This may possibly be due to the data set selection process.
hotel_bookings$corporateornot <- ifelse(hotel_bookings$distribution_channel == "Corporate", "Corporate", "No")
cancelratebycorporate <- aggregate(is_canceled ~ corporateornot, data=hotel_bookings, mean)
barplot(cancelratebycorporate$is_canceled, col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), names.arg = cancelratebycorporate$corporateornot, xlab="Booking Type", ylab="Cancellation Rate", border =TRUE, axes= TRUE, ylim=c(0,1))
Corporate bookings tend to cancel at a much lower rate than non corporate bookings. A possible reason for this is that corporate bookings are paid for by the companies who are less likely to take the time to cancel individual bookings.
canceledbydistchannel <- tapply(hotel_bookings$is_canceled, hotel_bookings$distribution_channel, mean, na.rm = TRUE)
barplot(sort(canceledbydistchannel, decreasing = TRUE),ylab = "Cancellation Rate",xlab = "Distribution Channel", ylim = c(0,1), col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"))
canceled <- hotel_bookings[hotel_bookings$is_canceled == 1, ]
p<-ggplot(canceled, aes(x=distribution_channel, y=is_canceled)) +
geom_bar(stat="identity", fill="pink") +theme_minimal()
p
From the graph above, we can see that the “Undefined” distribution channel has the highest cancellation rate.However, when we look at quantity canceled for the Undefined channel in the second graph, we see that there are very few total “Undefined” cancellations despite the rate being very high. Aside from this, the channel with the highest cancellation rate is Travel Agents and Tour Operations.This distribution channel also has the highest total number of cancellations.
marketseg <- tapply(hotel_bookings$is_canceled, hotel_bookings$market_segment, mean, na.rm = TRUE)
barplot(sort(marketseg, decreasing = TRUE), ylim = c(0,1), legend = rownames(sort(marketseg, decreasing = TRUE)), col=c("lavenderblush","mistyrose", "pink", "lightpink","hotpink", "deeppink", "deeppink4", "maroon4"), las = 2)
Here, we can see that the “Undefined” market segment has almost a 100% cancellation rate while the “Complementary” segment has the lowest rate at around 20%.
custtype <- tapply(hotel_bookings$is_canceled, hotel_bookings$customer_type, mean, na.rm = TRUE)
barplot(sort(custtype), ylim = c(0,1), col=c("red","blue","green","pink"))
From this barplot we can see which types of customers are more likely to cancel across all customer types. We can see that “group” customers have the lowest cancellation rate (around 10%) while “transient” customers have the highest (around 40%).
mealeffect <- tapply(hotel_bookings$is_canceled, hotel_bookings$meal, mean, na.rm = TRUE)
barplot(sort(mealeffect), ylim = c(0,1), col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), xlab="Meal Plan Option", ylab="Cancellation Rate")
In general guests with the FB meal plan tend to have the highest rates of cancellation, but there isn’t a significant difference in cancellation rates with all the other meal plan options.
assroomtype <- tapply(hotel_bookings$is_canceled, hotel_bookings$assigned_room_type, mean, na.rm = TRUE)
barplot(sort(assroomtype), ylim = c(0,1), col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"))
Reservations that were assigned room “P” and “L” had a cancellation rate of almost 100%, while those assigned room “I” had a almost 0% cancellation rate.
reservedroomtype <- tapply(hotel_bookings$is_canceled, hotel_bookings$reserved_room_type, mean, na.rm = TRUE)
barplot(sort(reservedroomtype), ylim = c(0,1), col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), xlab="Reserved Room Type", ylab="Cancellation Rate")
Reservations of the “P” room type are the most likely to be canceled, while reservations of the “E” room type are least likely to be canceled.
repeatedguest <- tapply(hotel_bookings$is_canceled, hotel_bookings$is_repeated_guest, mean, na.rm = TRUE)
barplot(sort(repeatedguest), ylim = c(0,1), col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"), names.arg = c("Yes", "No"), xlab="Are They A Repeated Guest?", ylab="Cancellation Rate")
Repeated guests have a significantly lower cancellation rate than not repeated guests. One reason for this may be that the hotels and resorts have really great customer loyalty and customer service, which encourages guests to come back.
ggplot(data=hotel_bookings, aes(x=is_canceled, y = lead_time, colour=is_canceled)) + geom_jitter(alpha = 0.25)
Based on this jitter plot, we can see that there is a correlation between cancellation rate and lead time. It seems that as lead time to booking increases, the chance of cancellation also increases. This is depicted by the larger density of points in the 300-700 days for the canceled (1) bar versus the not-canceled (0) bar.
adr1000 <- hotel_bookings[hotel_bookings$adr<1000,]
ggplot(data=adr1000, aes(x=is_canceled, y = adr, colour=is_canceled)) + geom_jitter(alpha = 0.25)
Based on this jitter plot, we can not make any conclusions about the correlation between ADR and cancellation rate.
deposittype <- tapply(hotel_bookings$is_canceled, hotel_bookings$deposit_type, mean, na.rm = TRUE)
barplot(sort(deposittype), ylim = c(0,1), col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"))
summary(hotel_bookings$deposit_type)
## No Deposit Non Refund Refundable
## 104641 14587 162
From the bar plot we can see that non-refund deposit types have almost a 100% cancellation rate, while refundable and non-refundable deposit types have close to 20% and 30% cancellation rate. The sample size for the non-refund category is not small either (14,000+), making this discovery somewhat surprising as we would expect lower cancellation rates for deposit types that are not refundable. This is an insight we can take into account when developing our deposit type strategy structure.
#number of special requests
aggregate(is_canceled ~ total_of_special_requests, data=hotel_bookings, mean)
## total_of_special_requests is_canceled
## 1 0 0.4772036
## 2 1 0.2202492
## 3 2 0.2209885
## 4 3 0.1786143
## 5 4 0.1058824
## 6 5 0.0500000
numofspecialreqs <- aggregate(is_canceled ~ total_of_special_requests, data=hotel_bookings, mean)
barplot(numofspecialreqs$is_canceled, ylim = c(0,.5), names.arg = numofspecialreqs$total_of_special_requests,col=c("lavenderblush","pink","hotpink", "deeppink", "deeppink4"))
Based on this barplot, we see that as number of special requests increase, the rate of cancellations decrease. This makes sense because as more and more special requests are made, it is less likely that they will cancel for another booking because they would have to re-set-up those special requests. Also, it could be explained by the fact that the customer becomes more invested as number of special requests increase.
hotel_bookings$totalstaylength <- hotel_bookings$stays_in_week_nights + hotel_bookings$stays_in_weekend_nights
ggplot(data=hotel_bookings, aes(x=totalstaylength, y=total_of_special_requests, colour=is_canceled)) + geom_point(size=2, alpha=.5) + geom_smooth(method = "lm") + geom_jitter()
## `geom_smooth()` using formula 'y ~ x'
From the graph above, there are two conclusions that can be made. First, shorter stays seem to be linked with more special requests. Second, longer stays seem to have more cancellations than shorter stays. Based on this information, medium length stays should be targeted as special requests are lower and there also tends to be less relative cancellations.
Since a lot of bookings are through companies or agents, we wanted to analyze which agents and companies had a lot of cancellations and thus should be punished, and which have a lot of revenue generated and should be rewarded.
group_by(hotel_bookings, agent) %>% filter((n() >= 2000) & (agent != "NULL")) %>% ggplot(aes(x=agent)) + geom_bar(fill="steelblue")
It appears that agent 9 has by far the most number of bookings, with agent 240 following second. This suggests rewards or special accomodations might be warranted to get their continued business.
#prep data for other sections
agent_revenue = c()
agent_total = c()
agent_cancel = c()
agent_adr = c()
agents <- levels(hotel_bookings$agent)
for (a in agents){
if(a!="NULL"){
num_count = NROW(hotel_bookings$agent[hotel_bookings$agent==a])
agent_adr <- c(agent_adr, sum(hotel_bookings$adr[hotel_bookings$agent == a & hotel_bookings$is_canceled==0])/num_count)
agent_revenue <- c(agent_revenue,sum(hotel_bookings$adr[hotel_bookings$agent == a & hotel_bookings$is_canceled==0]*(hotel_bookings$stays_in_weekend_nights[hotel_bookings$agent == a & hotel_bookings$is_canceled==0] +hotel_bookings$stays_in_week_nights[hotel_bookings$agent == a & hotel_bookings$is_canceled==0])/ num_count))
agent_total <- c(agent_total,sum(hotel_bookings$adr[hotel_bookings$agent == a & hotel_bookings$is_canceled==0] *(hotel_bookings$stays_in_weekend_nights[hotel_bookings$agent == a & hotel_bookings$is_canceled==0] +hotel_bookings$stays_in_week_nights[hotel_bookings$agent == a & hotel_bookings$is_canceled==0])))
agent_cancel <- c(agent_cancel, NROW(hotel_bookings$agent[hotel_bookings$agent == a & hotel_bookings$is_canceled==1])/num_count)
}
}
agent_table = data.frame(agents = agents[1:length(agent_revenue)], agent_revenue = agent_revenue, agent_total = agent_total, agent_cancel =agent_cancel, agent_adr= agent_adr)
top_n(agent_table, n=5, agent_cancel) %>% ggplot(., aes(x=agents, y=agent_cancel)) + geom_bar(stat='identity' ,fill="red")+xlab("Agent ID")+ylab("Cancellation Rate")+ggtitle("Cancel Rate by Agent")
All of the above agents have 100% cancellation rate, so perhaps it might be worthy to look into why they have cancelled so much. Many agents have 0% cancellation, so it may not be worth to have a reward for those.
top_n(agent_table, n=5, agent_total) %>% ggplot(., aes(x=agents, y=agent_total)) + geom_bar(stat='identity' ,fill="maroon")+xlab("Agent ID")+ylab("Total Revenue Generated")+ggtitle("Total Revenue by Agent")
The total revenue generated is very closely related to the agents who have the most bookings–agent 9 has the most revenue generated with 240 as the second highest.
top_n(agent_table, n=5, agent_revenue) %>% ggplot(., aes(x=agents, y=agent_revenue)) + geom_bar(stat='identity', fill="steelblue")+xlab("Agent ID")+ylab("Average Revenue per Trip")+ggtitle("Average Trip Cost by Agent")
A better metric might to look at which agents generate the most expensive trips. The agents above each book trips of value near 2000 on average. Thus, it might be useful to suggest higher end packages to these agents and their clients.
Similar to agents, companies that cancel a lot or generate a lot of revenue should be punished and rewarded, respectively.
group_by(hotel_bookings, company) %>% filter((n() >= 200) & (company != "NULL")) %>% ggplot(aes(x=company)) + geom_bar(fill="steelblue")+xlab("Company ID")+ylab("Number of Bookings")+ggtitle("Bookings per Company")
It appears that companies 40 and 223 have the two most number of bookings. Perhaps these companies are tour companies, or if they are businesses booking for their employees, it may be beneficial to establish rewards programs so all their employees book here in the future.
#setup code
company_revenue = c()
company_total = c()
company_cancel = c()
company_adr = c()
companies <- levels(hotel_bookings$company)
for (a in companies){
if(a!="NULL"){
num_count = NROW(hotel_bookings$company[hotel_bookings$company==a])
company_adr <- c(company_adr, sum(hotel_bookings$adr[hotel_bookings$company == a & hotel_bookings$is_canceled==0])/num_count)
company_revenue <- c(company_revenue,sum(hotel_bookings$adr[hotel_bookings$company == a & hotel_bookings$is_canceled==0] * (hotel_bookings$stays_in_weekend_nights[hotel_bookings$company == a & hotel_bookings$is_canceled==0] + hotel_bookings$stays_in_week_nights[hotel_bookings$company == a & hotel_bookings$is_canceled==0])/ num_count))
company_total <- c(company_total,sum(hotel_bookings$adr[hotel_bookings$company == a & hotel_bookings$is_canceled==0] * (hotel_bookings$stays_in_weekend_nights[hotel_bookings$company == a & hotel_bookings$is_canceled==0] + hotel_bookings$stays_in_week_nights[hotel_bookings$company == a & hotel_bookings$is_canceled==0])))
company_cancel <- c(company_cancel, NROW(hotel_bookings$company[hotel_bookings$company == a & hotel_bookings$is_canceled==1])/num_count)
}
}
company_table = data.frame(company = companies[1:length(company_revenue)], company_revenue = company_revenue, company_total = company_total, company_cancel = company_cancel, company_adr = company_adr)
top_n(company_table, n=5, company_cancel) %>% ggplot(., aes(x=company, y=company_cancel)) + geom_bar(stat='identity' ,fill="red") + ylab("Cancellation Rate") + ggtitle("Cancel Rate by Company") +xlab("Company ID")
All of the above companies have 100% cancellation rate, so perhaps it might be worthy to look into why they have canceled so much, and possibly prevent their bookings in the future.
top_n(company_table, n=5, company_total) %>% ggplot(., aes(x=company, y=company_total)) + geom_bar(stat='identity' ,fill="maroon") + ylab("Total Revenue")+xlab("Company ID")+ggtitle("Total Revenue by Company")
Company 223 has by far the most generated revenue, and 40 follows second. These were the same 2 that had the most bookings, but 223 was second to 40. This likely suggests that 223 has more expensive bookings, while 40 has a lot of revenue through sheer volume of bookings.
top_n(company_table, n=5, company_revenue) %>% ggplot(., aes(x=company, y=company_revenue)) + geom_bar(stat='identity', fill="steelblue")+ ylab("Average Revenue per Booking")+xlab("Company ID")+ggtitle("Average Booking Revenue by Company")
Above show the companies who book expensive trips on average. Copmany 368 by far has the most expensive at over 3000 on average per trip. One opportunity may be to give discounts or other incentives to continue gaining the high source of income per booking, or look into advertising higher amenities.
Many agents and companies have 100% cancellation rate, which should certainly be investigated as this directly impacts profits. Countless other agents and companies have perfect book rate, which should conversely be rewarded. Agents 9 and 240 have a large amount of booking and have generated lots of revenue, worthy of special notice, as does companies 223 and 40 for the same reasons. One assumption made was that ADR already factored in discounts for duration, so only the total value of the trip was important rather than how expensive per day. So if the goal is to have consistent revenue, this analysis showed promising targets; however, if the goal is maximal revenue per trip, the graphs below only consider ADR.
top_n(company_table, n=5, company_adr) %>% ggplot(., aes(x=company, y=company_adr)) + geom_bar(stat='identity' ,fill="steelblue") + ylab("Average ADR")+xlab("Company ID")+ggtitle("Highest ADR by Company")
We can see here the companies with the highest average ADR, so these trips might be for more wealthy business class trips and thus should get advertised higher end packages.
top_n(agent_table, n=5, agent_adr) %>% ggplot(., aes(x=agents, y=agent_adr)) + geom_bar(stat='identity', fill="steelblue")+xlab("Agent ID")+ylab("Average ADR per Trip")+ggtitle("Average ADR by Agent")
Here, the agents with the highest average ADR are likely booking trips for more wealthy individuals and thus should get advertised higher end packages.
After our analysis, we have identified several interesting insights about hotel bookings and cancellation rates. The two hotels are not direct competitors because they have different customers and positions. Resort Hotel is more family-oriented, serving the visitors spending their vacation, while City Hotel serves business travelers more. Moving forward, if both Resort Hotel and City Hotel want to lower customer cancellation rates to secure a stable demand, it can focus on customer groups that are repeated guests, corporate customers, and direct bookings, which all have lower cancellation rates compared to their counterparts.
In terms of market segment and distribution channels, the hotels should look into the undefined groups as they have an almost 100% cancellation rate. It also may be beneficial to look into why those that are assigned and reserved room type “P” cancel at almost a 100% rate as well. On top of this, customers in the transient category (short stays) had the highest cancellation rate at around 40%. This is expected as short stays tend to correlate with more unpredictable agendas.
We found that ADR is not a good indicator of cancellation rate, so a new pricing scheme or price adjustments should not be needed in order to decrease cancellations.
One unsurprising factor that we found was that guests with a higher number of previous cancellations seem to cancel at higher rates than guests with a lower number of previous cancellations. In addition, guests with the FB meal plan tend to have higher rates of cancellation compared to the other meal plans, so it’s worth looking into to see if there is an underlying reason for this trend.
For lead time, we discovered that at greater lead times (300-700 days), the customer is more likely to cancel. From this and our jitter plot, we can say that as lead time decreases (customers book closer to their stay date) it is less likely that they will cancel.
One surprising insight that we found was related to cancellation rates and deposit types. We found that bookings that did not require deposits had a much lower cancellation rate than those that had a non-refundable deposit. The bookings that required non-refundable deposits actually had a almost 100% cancellation rate, while those bookings that did not require deposits had closer to a 30% cancellation rate. The hotels should look into which bookings require deposits and which ones don’t to further narrow down why those that require a non-refundable deposit have an extremely high cancellation rate.
One factor that we found had some level of correlation with cancellation rates was the total stay length. While shorter stays obviously still had cancellations, we found that canceling was more common among longer stays. On the flip side, shorter stays often brought pickier customers, as special requests tended to be higher for these stays. In order to minimize cancellations while also maximizing occupancy rates, hotels should try incentivizing medium length stays, as they tend to have lower special requests and cancellation rates.
When looking at the companies and agents who book frequently and generate a lot of revenue, we identified several agents and companies who have a 100% cancellation rate, indicating that some form of preventative action might be needed. We also identified frequent bookers and which ones spend more lavishly, so targeted advertising might be in order to increase profits with those agents and companies.
Overall, while we conducted our analysis with the assumption that cancellations were bad, cancellations could actually be beneficial if the right steps are taken. If the hotels require non-refundable deposits for bookings and then ultimately cancel, then that is added revenue towards the hotels. We can take the data that we collected to attach non-refundable deposits to those groups that have the highest rate of cancellations. The hotel can also utilize overbooking to maximize revenue generated and dampen the effects of cancellations.